## Skipping install of 'leaflet' from a github remote, the SHA1 (5bf15ecf) has not changed since last install.
## Use `force = TRUE` to force installation
After Biden’s recent announcement of a federal student laon cancellation program, many people are pushing for universal student laon forgiveness. However, there’s still a lot of uncertainty in this area. Looking at this topic from both a historical and a socioeconomic perspective would shed light on the contemporary discussion of student loan forgiveness policy.
This project explores how student debt and its repayment rate interact with socioeconomic conditions. This was a really broad topic so we did a decent amount of data exploration before zooming in our research focus. The project involves three parts. First, We tried to focus on specific aspects of the student loan issue including university selectivity, family income, repayment rate over time, and the regional difference of debt amount to understand why student debt becomes a more and more serious problem that people in US now are highly concerned with. After looking at the historical data, we also want to understand the current discussion on student loan forgiveness in light of the Biden administration’s recent announcement of a $1 Billion cancellation in student loans. Specifically, we looked at the tweets keywords and examined if there’s an overlap between states that are most concerned about student debt and where the selective schools are located in the US.
We worked with the 2019 College Scorecard (and with the 2010-2019 College Scorecard) to look at some trends that we thought would be interesting. The following is the dataset and some basic columns added to it for our analysis work. One way we did this was through attaining and concatenating the College Scorecard data as directly downloaded from the NCES. Below are the columns we selected for this:
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## INSTNM = col_character(),
## CITY = col_character(),
## STABBR = col_character(),
## ZIP = col_character(),
## UNEMP_RATE = col_logical(),
## DEBT_MDN = col_character(),
## MN_EARN_WNE_P10 = col_logical(),
## MD_EARN_WNE_P10 = col_logical(),
## School = col_logical(),
## State = col_logical(),
## True = col_character(),
## ADM_RATE_ALL_1 = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
| variable | description |
|---|---|
| UNITID | IPEDS Unit ID |
| INSTNM | Institution Name |
| CITY | description |
| state | state |
| ZIP | Zip Code |
| DEBT_MDN | The median original amount of the loan principal upon entering repayment |
| UGDS | Number |
| ADM_RATE | Admission Rate |
| CONTROL | Private / Public Nonprofit (1 and 2) vs. For-Profit (3) Schools |
| MN_EARN_WNE_P10 | Min Earning after 10 Years (not used due to incomplete data) |
| MD_EARN_WNE_P10 | Med Earning after 10 Years (not used due to incomplete data) |
| UGDS_WHITE; UGDS_BLACK; UGDS_HISP; UGD.S_ASIAN | Proportion of Undergrads (White, Black, Hispanic, Asian) |
| UGDS | Number of Undergrads |
| School | Same as INSTNM (disregard - used to filter out community colleges) |
| ADM_RATE_ALL_1 | Admission Rate by OPE ID (not used in the end) |
| Year, Year_Starting, Year_Ending | Year Starting (twice due to how the data was read in) and Year Ending period covered by the scorecard |
This dataset came with lat and lon columns, which made it somewhat easier to plot the universities without needing to Geocode their locations.
We also looked at data using the R package developed by the Urban Institute; specifically, we examined factors such as repayment rates.
The CSV file contains two columns. the first column shows the monthly date from 2004-01 to 2021-04 and the second column shows the number of searches on “student loan forgiveness” on Google in that month. We hope to observe changes in the number of monthly searches over time.
When we were further along in our data visualization strategy, we also used the Twitter API to attain Tweet data about student debt, searching for key hashtags involving student debt, including #cancelstudentloandeby, #helpmiddleclassamericans, #zeropercentintereststudentloans as well as mentions of the terms ‘borrower defense’ and ‘loan forgiveness’. The size of the Tweet data comes down to 2.4MB. In the preprocessing steps, we selected only the following columns for analysis: “user”,“text”,“display_text_width”, “hashtags”,“location”. We want to use the text column to calculate the sentiment score for each tweet and the average sentiment score for each state. The text width column helps us to find out which state is most concerned about canceling student debt. Since our primary focus for the Twitter data set is to draw patterns of these tweets with geographical information and Twitter doesn’t provide sufficient geographical information for each tweet, we relied heavily on the users’ location information. This process involves getting rid of all the NA values and using google API to geocode users’ locations to longitude and latitude. Then, we reverse geocoded the data to get state and county data.
As discussed in the introductory section, we had far too many subtopics within our dataset and we wanted to better explore a specific sub-topic prior to planning our visualizations. Thus, we began with trying to identify patterns within our dataset and other related datasets.
We started by testing whether or not we would be able to generate a graph of where universities in the US are located. This was mostly to validate that the lat and lon columns were working properly and were not going to cause issues if we were to plot these institutions on a map.
We then started trying to look at some trends that could be compiled into our visualization project. Some themes we were interested in were race, socioeconomic status, student debt, and economic outcomes.
This section of our notebook focuses on the information we have chosen to present for our visualizations and how we arrived at our current iterations.
Because our project focuses on student debt and socioeconomic status and economic outcomes, one of the first things we looked at focused on student debt for different levels of selective-ness for universities in 2019. Our preliminary graph was strictly exploratory - we categorized schools from the College Scorecard by admissions rate (filtering specifically for 2019), splitting into the following categories:
These categories were then plotted to show median student debt of each institution.
This was not a very good visual:
Despite these large issues, it did indicate that elite schools and highly selective schools appear to exhibit lower levels of student debt on average. This was an interesting finding, and related to socioeconomic background vs because more selective schools are associated with higher outcomes after school and also often admitted a portion of legacy students and financially well-off students, who were able to receive tutoring for school and to devote time to activities (e.g., sports and leadership positions) that would allow them to more competitively gain entry into such schools.
Thus, based on this preliminary visual, we made a number of adjustments:
We also considered showing the distribution as a box / violin plot (see below), however, we concluded that it did not provide enough additional information and did not show the trends we are trying to highlight as clearly as I intended (namely, that elite and highly selective schools generally exhibited lower levels of median student debt).
In order to understand whether or not students are having more trouble with their increasing debts, we wanted to visualize repayment rates over time. The data for this was very complicated:
| variable | description |
|---|---|
| unitid | unique school identifier |
| cohort_year | a group of students in a school who began repaying their loans in the same year |
| years_since_entering_repayment | the year of observation of the cohort since their first year |
| repay_rate | the proportion of students to make progress on their loans |
| repay_rate_ |
same as repay rate, but the denominator is the fraction of students in various income brackets |
| repay_count | the number of students to make progress their loans (the numerator of repay_rate) |
| repay_count_ |
number of students to make progress on their loans by income bracket |
This means that each university has multiple observations for each cohort_year (on each year of their repayment), and the same information as divided by income bracket. And, this information was presented as either the numerator or the proportion of the all students to enter the repayment phase of their student loans.
Therefore, we wanted to visualize the variation in proportion of students to make progress on their loans over time. And so, we took the median repayment rate for each cohort year in their first year of repayment and made the below plot:
But, in order to control for variation in the number of students across schools, I remade the plot by
In addition to modifying the calculations, I made some changes in the visual representation:
These changes yielded the below plot:
The Google trend website gives us data about the time and the count of the total number of searches on Google. Plotting the total number of searches over time helps us to see changes in people’s interest in student loan forgiveness. The pattern clearly shows that people’s interest in this topic has spiked in 2021. We hypothesized that this is a result of the Biden administration’s recent announcement of a reinterpretation of a federal student loan cancellation program which will result in $1 billion in student loan forgiveness. Ever since this announcement, more people have been pushing for a more progressive policy and asking President Biden to cancel student debt through executive action. However, there is still a lot of uncertainties around whether a universal student loan forgiveness will be initiated via executive order or legislation.
When looking at the extracted words and associated sentiments, we looked first at some of the most frequent words (and their cntribution to the seniment score of the tweet itself):
We decided that the above bar graph could be better expressed differently (either through a map or through a wordcloud) and tried to do this instead.
Our first wordcloud included the primary words in he hashtag and some frequent words that we did not feel added enough meaning (we are already aware given the selected hashtags that the topic would be student debt for example). We modified our wordcloud by removing these terms.
The modified wordcloud map shows key words that appeared in the tweets data. The size of the word represents how frequent it appears in tweets. Some of the most noticeable ones include people, president, loans, job, college, pay, biden, education, studentloanforgiveness, etc.
Below is the initial choropleth map we wanted to put in our project. It shows the distribution of average sentiment score in each state. The popup window shows the state, the associated average sentiment score, the total word counts, and the top tweets keywords. Darker color means the state has a higher average sentiment score.
We scrapped these in part because the Chloropleth itself (with statewide sentiment score) was once again not very informative as it does not end up adding to our story even about how people generally are feeling. We don’t have enough information from this map about how people are feeling in different parts of the country, and it’s harder to conceptualize or understand the sentiment score itself (for the viewer) in a salient way.
We also looked at debt and racial breakdown. As shown here, we did not identify a very clear trend - as such, we left ‘race and student debt’ alone for the time being as we were attempting to work on scaling down to more specific topics.
Finally, we tried to take our tweet data and look at the frequency of the positive and negative words over time. After looking at this information, we noted that it did not add anything to our current information (as again sentiment scores can be abstract, and there were no trends based on sentiment identified); we chose not to include or develop these further.
[1] - Conwell, Jordan A. and Natasha Quadlin. Forthcoming. “Race, Gender,Higher Education, and Socioeconomic Attainment: Evidence from Baby Boomers at Mid-Life.” Social Forces. DOI: 10.1093/sf/soab010.